Predicting Future Scientific Research with Knowledge Graphs
Matheus Schmitz¹ – mschmitz@usc.edu
Rehan Ahmed¹ – rehanahm@usc.edu
¹University of Southern California
1 Introduction
The rise of Data Science as an academic discipline hails at least
in part from its efficiency as a leverage tool, able to power-up
many pre-existing data-dependent approaches to problem
solving, knowledge discovery & comprehension, and
predictive systems, among others.
Looking to leverage this strength, this project was devised with
the intent of combining data science techniques, primarily
Knowledge Graphs and Artificial Intelligence, in the pursuit of
uncovering trends and areas of interest for future scientific
research.
Given the volume of academic papers published every year, a
subject matter for the development of the predictive system
had to be decided upon. The authors saw no more suitable
choice than Artificial Intelligence itself.
2 Data Gathering
The data used throughout this project was obtained from four
sources, all of which are well-regarded repositories of
academic research, some wide-encompassing and others
focused on topics pertinent to computer science. Those
sources are: Google Scholar, arXiv, Semantic Scholar, and dblp.
Data gathering was accomplished with a combination of web
scraping tools (Scrapy) and APIs (Semantic Scholar, dblp).
Multiple sources were necessary as each possesses
complementary capabilities. Google Scholar allows for the
mining of highly relevant papers, but not of their abstracts,
while arXiv does not possess relevance but has abstracts and
allows for crawling based on title, making it a perfect pairing
for papers obtained from Google Scholar.
Semantic Scholar for its turn has an API that vastly simplifies
data crawling in large volumes. Its API’s json-formatted
outputs enabled the creation of a “seed trail” crawler, which
iteratively extracts papers based on the citations and
references of previously crawled papers. This feature was
critical for this project, as given the total size of the published
data on the internet, if one were to randomly crawl papers, the
resulting graph would be very sparsely connected. The “seed
trail” crawling approach creates a strongly connected graph,
enabling many of the further steps of this project.
dblp offers a simple DOI lookup tool which simplified the task
of gathering complimentary information for the papers
crawled from Semantic Scholar.
3 Entity Resolution
Given the various sources used, entity resolution was
performed in three steps: Firstly, Google Scholar and arXiv
were merged using entity linkage functions that considered
title, author, and date. Second, Semantic Scholar and dblp
were merged using DOI. Third, the two previous merges were
combined into one, based on title, author and DOI. Performing
blocking by author achieved a reduction ratio of 0.93, shrinking
a total of 43 million pairs into 3 million pairs for comparison.
Entity resolution resulted in 20534 unique papers.
Using a validation set the authors found the True Positive Rate
(TPR) of the entity linkage to be 99.53%.
4 External Knowledge Graph Linkage
Papers available on the aforementioned sources have no
additional information about the authors other than their
names. In order to complement that data, the crawled dataset
was linked to DBPedia through programmatic queries to its
API, which, thanks to an existing property
dbo:academicDiscipline, enable a high precision (albeit not
high recall, with only 298 matches) extraction of information
about the papers’ authors. This data was appended to the
dataset.
5 Triple Generation & Ontology
Using python’s RDFlib the dataset was converted into a set of
RDF triples in the Turtle format. All entity types, predicates
and properties are derived from schema.org, xml, xds, foaf and
DBpedia.
The resulting ontology contains 4 entity types: Scholarly
Article, Person, Genre, and Publisher.
Connecting those entities are 6 predicate (relationship) types:
creditText, reference, citation, author, genre, and publisher.
The ontology uses a total of 12 property types: name,
birthdate, DBPedia URL, PageRank, headline, abstract, URL,
DOI, influentialCitationCount, citationVelocity, dateModified,
and datePublished.
The main entity in the resulting RDF ontology is
ScholarlyArticle, which contains the 20.5k academic articles
crawled from the web with their full data, and another 435k
articles which were mentioned as references but not crawled.
An URI was also created for each of the references in each of
the crawled papers to enable easy extensions of the graph.
Simple statistics about the triple generation process are as
follows: 453k paper URIs (including references), 51k author
URIs, 207 genres, 86 publishers, 1.5M relationships, 505k
nodes, 2.65M tiples.
The Turtle RDF was deployed to a Neo4J Graph Database and
made available via an endpoint on a Flask user interface.